is a widely used approach to describe the relationship between a response variable Y and a set of predictors X_1, X_2, \dots, X_j. Linear models are popular due to their simplicity, ease of implementation, and interpretability. The coefficients provide direct insights into the relationship between each predictor and the response, making it straightforward to infer the impact of individual variables.
Show the code
set.seed(321)library(ggplot2)library(patchwork)x <-seq(-10, 10)y <-2* x +rnorm(length(x), mean =0, sd =2)z <- y^2+rnorm(length(x), mean =0, sd =0.5)# Generating some datadata <-data.frame(x = x, y = y, z = z)# Plottingp_1 <- data |>ggplot(aes(x = x, y = y)) +geom_point(size =1.5, colour ="steelblue4") +labs(title ="Linear relationship between x and y") +theme_minimal() +theme(plot.title =element_text(hjust =0.5, size =13))p_2 <- data |>ggplot(aes(x = x, y = z)) +geom_point(colour ="steelblue4") +labs(title ="Non-linear relationship between x and y", y ="y") +theme_minimal() +theme(plot.title =element_text(hjust =0.5, size =13))p_1 + p_2 +plot_layout(axes ="collect")
Figure 1: Comparison of linear and nonlinear relationships. The left panel illustrates a linear relationship between x and y, where the data points are scattered around a straight line. The right panel demonstrates a nonlinear relationship between x and y, where the data points follow a quadratic pattern.
However, the standard linear model relies on the assumption of linearity, which is often an approximation of the true underlying relationship. In many real-world scenarios, this assumption may not hold (Figure 1), leading to significant limitations in predictive accuracy. When the true relationship is nonlinear, a linear model may fail to capture complex patterns, resulting in poor performance.
We can see this in the figure below where our straight-line equation captures the data well in the left hand plot, but fails to do so in the right hand plot.
Figure 2: We can see that our linear model does not do a good job at capturing the quadratic relationship in our data.
As such, various extensions of linear models have been developed which both relax the assumption of linearity while also maintaining as much interpretability as possible.
Polynomial regression
Polynomial regression is a natural extension of the standard linear model, where the relationship between the predictor X and the response Y is modeled as an n-th degree polynomial. This allows the model to capture curvature in the data while still using a linear framework for estimation. The model takes the form:
Here, X, X^2, \dots, X^n are the polynomial terms, and \beta_1, \beta_2, \dots, \beta_n are their corresponding coefficients. Despite the inclusion of nonlinear terms, polynomial regression remains linear in terms of estimation because the regression function is linear in the unknown parameters (\alpha, \beta_1, \dots, \beta_n). This key property allows us to leverage the techniques of multiple linear regression for estimation and inference. Specifically, polynomial regression can be implemented by treating X, X^2, \dots,X^n as distinct independent variables in a multiple regression framework. As a result, the computational and inferential challenges of polynomial regression can be fully addressed using well-established methods, such as ordinary least squares.
Let’s refit a model to our plot but this using a 2^{\text{nd}} degree polynomial. Hence, we are fitting a model that looks like:
---title: "Non-linear modelling"date: "2025-02-13"format: html: page-layout: full html-math-method: katexcitation: url: "https://c-monaghan.github.io/posts/2025/"execute: warning: falseeditor: source---For regression, the standard linear model:$$Y= \alpha + \beta_1X_1 + \beta_2X_2 + \dots + \beta_jX_j + \epsilon$$is a widely used approach to describe the relationship between a response variable $Y$ and a set of predictors $X_1, X_2, \dots, X_j$. Linear models are popular due to their simplicity, ease of implementation, and interpretability. The coefficients provide direct insights into the relationship between each predictor and the response, making it straightforward to infer the impact of individual variables.::: {.cell}```{.r .cell-code}set.seed(321)library(ggplot2)library(patchwork)x <- seq(-10, 10)y <- 2 * x + rnorm(length(x), mean = 0, sd = 2)z <- y^2 + rnorm(length(x), mean = 0, sd = 0.5)# Generating some datadata <- data.frame(x = x, y = y, z = z)# Plottingp_1 <- data |> ggplot(aes(x = x, y = y)) + geom_point(size = 1.5, colour = "steelblue4") + labs(title = "Linear relationship between x and y") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5, size = 13))p_2 <- data |> ggplot(aes(x = x, y = z)) + geom_point(colour = "steelblue4") + labs(title = "Non-linear relationship between x and y", y = "y") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5, size = 13))p_1 + p_2 + plot_layout(axes = "collect")```::: {.cell-output-display}{#fig-non-linear width=768}::::::However, the standard linear model relies on the assumption of linearity, which is often an approximation of the true underlying relationship. In many real-world scenarios, this assumption may not hold (@fig-non-linear), leading to significant limitations in predictive accuracy. When the true relationship is nonlinear, a linear model may fail to capture complex patterns, resulting in poor performance. We can see this in the figure below where our straight-line equation captures the data well in the left hand plot, but fails to do so in the right hand plot.::: {.cell}```{.r .cell-code}p_1 <- p_1 + geom_smooth(method = "lm", colour = "firebrick4", linewidth = 1.25, se = FALSE)p_2 <- p_2 + geom_smooth(method = "lm", colour = "firebrick4", linewidth = 1.25, se = FALSE)p_1 + p_2 + plot_layout(axes = "collect")```::: {.cell-output-display}{#fig-fit-non-linear width=768}::::::As such, various extensions of linear models have been developed which both relax the assumption of linearity while also maintaining as much interpretability as possible.## Polynomial regressionPolynomial regression is a natural extension of the standard linear model, where the relationship between the predictor $X$ and the response $Y$ is modeled as an $n$-th degree polynomial. This allows the model to capture curvature in the data while still using a linear framework for estimation. The model takes the form:$$Y= \alpha + \beta_1X + \beta_2X^2 + \dots + \beta_nX^n + \epsilon$$Here, $X, X^2, \dots, X^n$ are the polynomial terms, and $\beta_1, \beta_2, \dots, \beta_n$ are their corresponding coefficients. Despite the inclusion of nonlinear terms, polynomial regression remains linear in terms of estimation because the regression function is linear in the unknown parameters $(\alpha, \beta_1, \dots, \beta_n)$. This key property allows us to leverage the techniques of multiple linear regression for estimation and inference. Specifically, polynomial regression can be implemented by treating $X, X^2, \dots,X^n$ as distinct independent variables in a multiple regression framework. As a result, the computational and inferential challenges of polynomial regression can be fully addressed using well-established methods, such as ordinary least squares.Let's refit a model to our plot but this using a $2^{\text{nd}}$ degree polynomial. Hence, we are fitting a model that looks like:$$Y = \alpha + \beta_1X + \beta_2X^2 + \epsilon$$::: {.cell}```{.r .cell-code}p_2 + geom_smooth(formula = y ~ poly(x, 2), method = "lm", colour = "forestgreen", linewidth = 1.25, se = FALSE)```::: {.cell-output-display}{#fig-refit-non-linear width=768}::::::